# Data Schema — Flourishing Indicators Dataset (Geotweet Archive v2.0)

This document describes the structure and content of the four main dataset files distributed in this repository:

- `flourishingCountyMonth.(csv|parquet)`
- `flourishingCountyYear.(csv|parquet)`
- `flourishingStateMonth.(csv|parquet)`
- `flourishingStateYear.(csv|parquet)`

Each file provides **aggregated indicators of human flourishing dimensions** derived from 2.6 billion geolocated tweets (2010–2023).  
All files are in **long format**, meaning that each row corresponds to one combination of:

> { geographic unit × time period × well-being dimension }

---

## 🗺️ Geographic and Temporal Coverage

| File | Spatial resolution | Temporal resolution | Rows represent |
|------|--------------------|--------------------|----------------|
| `flourishingCountyMonth` | County (FIPS-5) | Month | `{county, year, month, dimension}` |
| `flourishingCountyYear`  | County (FIPS-5) | Year  | `{county, year, dimension}` |
| `flourishingStateMonth`  | State (FIPS-2)  | Month | `{state, year, month, dimension}` |
| `flourishingStateYear`   | State (FIPS-2)  | Year  | `{state, year, dimension}` |

All FIPS codes follow U.S. Census Bureau conventions (zero-padded).  
`StateCounty` = concatenation of 2-digit state FIPS and 3-digit county code.

---

## 📊 Indicator Definition

- **`stat`** – Average score of the specified flourishing dimension among tweets where that dimension is detected.  
  - Values lie in **[-1, +1]**, corresponding to negative, neutral, or positive expression levels.  
  - Zeros assigned at the tweet level are excluded when computing these means.
- **`stat_se`** – Standard deviation of `stat` (if computed; may be missing).
- **`salience`** – Fraction of tweets expressing the dimension within the cell; represents **relative prevalence**, *not* linguistic or cognitive salience.
- **`ntweets`** – Total number of tweets in the spatio-temporal cell.
- **`validtweets`** – Number of tweets where the given dimension is detected.
- **`natweets`** – Number of tweets where the dimension is absent (`ntweets − validtweets`).
- **`variable`** – Name of the flourishing dimension (e.g., `happiness`, `optimism`, `loneliness`, `jobsat`, etc.).
- **`FIPS`, `county`, `StateCounty`, `state`** – Geographic identifiers.
- **`year`, `month`** – Temporal identifiers (month included only in monthly tables).

---

## 🧩 Schema Overview

### County–Year

| Field | Type | Description |
|:------|:-----|:-------------|
| `variable` | string | Flourishing dimension name. |
| `stat` | float | Indicator in [-1,+1]. |
| `stat_se` | float | Standard deviation (optional). |
| `salience` | float | Fraction of tweets expressing the dimension (0–1). |
| `ntweets` | int | Total tweets for the county–year cell. |
| `validtweets` | int | Tweets where the dimension is present. |
| `natweets` | int | Tweets where the dimension is absent. |
| `FIPS` | string | 2-digit state FIPS. |
| `county` | string | 3-digit county code. |
| `StateCounty` | string | Combined 5-digit county FIPS. |
| `year` | int | Year of aggregation. |

---

### County–Month

| Field | Type | Description |
|:------|:-----|:-------------|
| `variable` | string | Flourishing dimension name. |
| `stat` | float | Monthly indicator in [-1,+1]. |
| `stat_se` | float | Standard deviation (optional). |
| `salience` | float | Fraction of tweets expressing the dimension (0–1). |
| `ntweets` | int | Total tweets for the county–month cell. |
| `validtweets` | int | Tweets where the dimension is present. |
| `natweets` | int | Tweets where the dimension is absent. |
| `FIPS` | string | 2-digit state FIPS. |
| `county` | string | 3-digit county code. |
| `StateCounty` | string | Combined 5-digit county FIPS. |
| `year` | int | Year of aggregation. |
| `month` | int | Month of aggregation (1–12). |

---

### State–Year

| Field | Type | Description |
|:------|:-----|:-------------|
| `variable` | string | Flourishing dimension name. |
| `stat` | float | Indicator in [-1,+1]. |
| `stat_se` | float | Standard deviation (optional). |
| `salience` | float | Fraction of tweets expressing the dimension (0–1). |
| `ntweets` | int | Total tweets for the state–year cell. |
| `validtweets` | int | Tweets where the dimension is present. |
| `natweets` | int | Tweets where the dimension is absent. |
| `FIPS` | string | 2-digit state FIPS. |
| `state` | string | State name or postal abbreviation (if present). |
| `year` | int | Year of aggregation. |

---

### State–Month

| Field | Type | Description |
|:------|:-----|:-------------|
| `variable` | string | Flourishing dimension name. |
| `stat` | float | Monthly indicator in [-1,+1]. |
| `stat_se` | float | Standard deviation (optional). |
| `salience` | float | Fraction of tweets expressing the dimension (0–1). |
| `ntweets` | int | Total tweets for the state–month cell. |
| `validtweets` | int | Tweets where the dimension is present. |
| `natweets` | int | Tweets where the dimension is absent. |
| `FIPS` | string | 2-digit state FIPS. |
| `state` | string | State name or postal abbreviation (if present). |
| `year` | int | Year of aggregation. |
| `month` | int | Month of aggregation (1–12). |

---

## 💾 File Formats

All files are distributed in both **CSV** (UTF-8, comma-separated) and **Parquet** formats.

- The **CSV** versions provide maximum accessibility.
- The **Parquet** versions preserve column types and enable efficient analytical queries (e.g., via DuckDB).

---

## 🧠 Notes

- Indicators are **conditional means** over tweets where each dimension is detected; zeros used internally in the pipeline are excluded from denominators.  
- `salience` represents *relative prevalence*, not cognitive or linguistic salience.  
- All numeric values are rounded to a maximum of three decimal places.  
- Missing values are encoded as empty cells or `NA`.

---

*Citation:*  
TBD
